A sample that does not represent the intended population and can lead to distorted findings.
Bin Limits (see also intervals)
The endpoints of the intervals into which numeric data is sorted. The right-hand bin limit of one interval is the left-hand bin limit of the next interval. Used with histograms.
Category
The generic term used for either class or interval (bin) regardless of whether the data is characterized by name-descriptions or number-values.
Classes
A discrete (not connected) way of grouping data characterized by a name-description (e.g. for the characterization: gender, classes are M,F). Used with bar charts (column diagrams).
Contingency Table
A table which shows the actual or relative frequency of two types of data at the same time in a table.
Continuous Random Variable
A random variable that can assume any numerical value within an interval.
CumulativeDistribution Function CDF
A function F(x) that sums the probabilities for values of the random variable that are less than or equal to the x.
Data
An observation or the set of observations of a single characteristic for each member of a sample or population. This characteristic can grouped by classes if it is a name-description or into intervals if it is number-value.
Dependent Sample
The observation from one sample is related to an observation from another sample.
Dependent Variable
Regression analysis: The second variable in the regression equation that may be influenced (dependent on) the independent (first) variable. Functions: The "outside"variable that depends on the "inside" independent variable(s).
Descriptive Statistics
Used to summarize or display data so that we can quickly obtain an overview. Includes measures of central tendency and dispersion.
Direct Observation
Gathering data while the subjects of interest are in their natural environment.
Discrete Random Variable
A random variable that assumes only integer values. The number of values may be finite or countable.
Degrees of Freedom
The sample size minus 1, i.e. n-1. Used to calculate measures of dispersion of a sample and the TTest statistic.
Emperical Rule
If a distribution follows a bell-shaped, symmetrical curve centered around the mean, we would expect approximately 68%, 95%, and 99.7% of the values to fall within one, two, and three standard devations around the mean respectively.
Expected Frequencies
The number of observations that would be expected for each category (bin, class) of a frequency distribution, assuming the null hypothesis is true with chi-squared analysis.
StochasticExperiment
An experiment in which all the possible outcomes are known, but the outcome of each doing of the experiment is unknown and independent of all previous doings. (Ex. Throw a die.)
Expected Value E(x)
The sum of the values of the random variable multiplied by their respective probabilities. (Corresponds directly to the weighted mean, i.e. the mean of a relative frequency distribution.)
Experiment
The process of measuring or observing an activity for the purpose of collecting data.
Frequency Table (Distribution)
A table that shows the number of data that fall into specific classes or intervals.
Focus Group
An observational technique where the subjects are aware that data is being collected. Business use this type of group to gather information in a group setting that is controlled by a moderator.
Goodness-of-Fit Test
Uses a sample to test whether a frequency distribution fits the predicted distribution.
Histogram
A bar graph showing thefrequency distribution, that is, the number of observations in each class as the height of each bar.
Independent Sample
The observations in Sample B are not related to the observations in Sample A.
Independent Variable
Regression analysis: The first variable in the regression equation that may be influence the dependent (second) variable. Functions: The "inside" variable that is used to calculate the "outside" dependent variable .
Inferential Statistics
Used to make claims or conclusions about a population based on a sample of data from that population.
Interquartile Range
Measures the spread of the center between quartiles Q1 and Q3 of a data set and is used to identify outliers (see also Box and whiskers diagram).
Intervals (see also bins)
A continuous (connected) way of grouping data that is characterized by number-values (e.g. test scores). The intervals are of equal length and cover the range; the right endpoint is open so that data on the endpoint is sorted into the next interval. Used with histograms.
Interval Estimate
A measure that provides an interval, that is a range of values that characterizes a population.
Joint Probability
The probability of the intersection of two events.
Law of Large Numbers
This law states that when an experiment is conducted a large number of times, the empirical probabilities of the process will converge to the classical probabilities.
Mean μ=
The average of the values in a data set. Their sum divided by their number. Mean is used with quantitative numeric data.
Measure
A value (number) that measures a characteristic of a data set. Often called a statistic.
Measures of Central Tendency
Refers to the measures such as mean, median, mode[1], quartiles which describe the "central" information about data in a sample. (See also Measures of dispersion.)
Measures of Dispersion
Refers to the measures such as variance and standard deviation which describe how centered (or dispersed) is data in a sample.
Median m
The measure (value) for which half the observations in the data set are higher and half the observations are lower. Median is used with numeric data.
Mode
The name-description (observation) in the data set that occurs frequently. Mode should only be used for non-numeric data.
Nominal Data
Data that is qualitative and not quantitative character. (Think: gender)
Nominal Level of Measurement
Lowest level of data where numbers are used to identify a group or category. (Think: shoe size)
Observed Frequencies
The number of actual observations noted for each categry of a frequency distribution with chi-squared analysis.
Ordinal Level of Measurement
This measurement has all the properties of nominal data with the added feature that we can rank the values from highest to lowest.
Outliers
Extreme values in a data set that (probably) should be discarded before analysis.
Parameter
A value (number) that measures a characteristic of a population. Also called a measure.
Percentile
Values that divide a sorted data set into 100 subsets, each with the same number of data (analagous to quartiles).
Point Estimate
A measure that provides a point (a single value) that characterizes a population. The mean is the most common point estimate.
Population
A set of similar objects or members with a common characteristic that can be qualified or quantified. The number of the set (cardinality) is denoted by N (capital letter).
Population size N (see sample size)
Is the number of data in the entire population and is usually denoted with N.
Primary Data
Data that is collected by the person who eventually uses the data.
Probability Distribution Function PDF
The listing or graphing of all of the values of a discrete random variable and their corresponding probability.
Probabilty Density Function PDF
The function f(x) whose domain (independent variable) is the interval of a continuous random variable and whose range (dependent variable) corresponds to probability such that for any [a,b]
Qualitative Data
Information which uses descriptive terms to measure or classify something of interest. (see also Nominal data)
Quantitative Data
Information which uses numerical values to describe something of interest.
Quartile
Q1=quartile 1 is the median of the data between the minimum=Q0 and the median =Q2, Q3=quartile 3 is the median of the data between the median =Q2 and the maximum=Q4.
Random Variable
A function that assigns a numeric value to each possible event.
Range
Given number-value data, the range is the largest data minus the smallest data, i.e. range=max-min.
Ratio Level of Measurement
Level of data that allows the use of all four mathematical operations to compare values and has a true zero point.?
Relative Frequency Table (Distribution)
A table that shows the decimal value or percentage of the frequencies of each category (class or interval) relative to the total number of data. (a.k.a. Normalized frequency table)
Sample
A subset of a population. The number of the sample (cardinality) is denoted by n (small letter).
Sample measures (see also degrees of freedom)
Measures (parameters) based on a sample of the population.
Sample size n
Is the number of data in the sample and is usually denoted with n.
Sample Space
The set of all possible outcomes of an experiment, that is, the set of all possible data observations in a population.
Sampling Error
An error which occurs when the sample measurement is different from the population measurement.
Secondary Data
Data that somebody else has collected and made available for others to use
Simple Random Sample
A sample where every element in the population has a chance at being selected.
Standard Deviation σ
A measure of dispersion calculated by taking the square root of the variance.
Statistic
Data that describes a characteristic about a sample. Often used to estimate a parameter.
Statistics
The science that deals with the collection, tabulation, and systematic slassification of quantitative data, especially as a basis for inference and induction.
Stratified Sample
A sample that is obtained by dividing the population into mutually exclusive groups or strata and randomly sampling from each of these groups.
Surveys
Data collection that involves directly asking a subject (member of the sample or population) a series of questions.
Systematic Sample
A sample where every kth member of the population is chosen for the sample, with the value of k being approximately N/n, where N is the size of the population and n is the size of the sample.
A measure of dispersion that describes the relative distance between the data points in the set and the mean of the data set.
Weighted Mean
The mean or expected value of data given with weights, i.e. frequencies or probabilities (e.g. Data: 0 with weight 0.6, 1 with weight 0.3 and 2 with weight 0.1 has a weighted mean of 0.5)
Probability
Term (English)
Definition (English)
Addition Rule of Probabilities
Determines the probability of the union of two or more events. (If the events are mutually exclusive, the probabilities are added.)
Bayes' Theorem
A theorem used to calculate P[B|A] from information about P[A|B]. The term P[A|B] refers to the conditional probability of A given B, that is to the probability of Event A, given that Event B has occurred.
BinomialTrial
An experiment that has only two possible outcomes for each trial called success and failure. The probability of success p is the same for each trial and so the probability of failure is q=1-p. A binomial trial is completely defined by knowing the value of p.
Binomial Experiment
An experiment in which one performs exactly n trials of a binomial trial with probability of success p. A binomial experiment is completely defined by the numbers n and p.
Classical Probability
Reference to situations when we know the number of possible outcomes of the experiment being conducted.
Combinations
The number of different ways in which objects can be arranged without regard to order (ABA=AAB).
Conditional Probability
The probability of Event A, knowing that Event B has already occurred. Denoted P[A|B]
Emperical Data
Data measured when actually performing an experiment or when generated using a random number generator (simulated real data)
Empirical Probability
The probability of an event occuring based on the relative frequency distribution of emperical data. Emperical probability is to theoretical probability as sample measures are to population measures.
Event
One or more outcomes of an experiment, i.e. a subset of the sample space.
Fundamental Counting Principle
A concept that states if one event can occur in m ways and another event can occur in n ways, then the total number of ways in which both events can occur together is m·n ways.
Independent Event
The occurrence of Event B has no effect on the occurence of Event A.
Intersection (see also union)
All of the events in the event set occur. Key word: and, e.g. Throw a die, E1: number is even; E2: number >3; Intersection is {4,6}, i.e. number is both even and >3.
Multiplication Rule of Probabilities
Determines the probability of the intersection of two or more events. (If the events are mutually exclusive, the probabilities are added.)
Mutually Exclusive Events
When two events cannot occur at the same time during an experiment.
Outcome
A particular result of an experiment.
Permutations
The number of different ways in which the objects of a set can be arranged with regard to order (ABA≠AAB). (In the US, permutations include variations.)
Probability (p)
The likelihood that a particular event will occur. p is a number between 0 and 1 (between 0% and 100%) where p=0 means the event will never occur and p=1 menas the event will always occur.
Sample SpaceΩ
In probability, the set of all possible mutually exclusive (elementary, independent) outcomes of an experiment.
Subjective Probability
The probability of an event occuring based on experience and intuition.
Theoretical Probability
The probability of an event occuring based solely on mathematics. The emperical probability will tend toward the theoretical probability given a (very) large number of experiments.
Union
At least one of the events in the event set occurs. Key word: or, e.g. Throw a die, E1: number is even; E2: number >3; Union is {2,4,5,6}, i.e. number is either even or >3.
Variations
The number of different ways in which a subset or superset of the objects of a set can be arranged with regard to order (ABA≠AAB).
Statistical Charts, Diagrams, Plots
Term (English)
Definition (English)
Bar Chart
A data display were the value of the observation is proportional to the height of the bar on the graph. (No classes, bins, intervals.)
Box and Whiskers Chart
This chart displays the data in quartiles with the vertical lines drawn at Q0=minimum, Q1, Q2=median, Q3 and Q4=maximum and box drawn around Q1 to Q3 (the interquartile range).
Line Diagram
A display where ordered pair data points are connected together with a line.
Pie Chart
Chart used to describe data from relative frequency distributions with a circle divided into sectors whose area is equal to the respective relative frequency.
Scatter Plot
Stem and Leaf Chart
This chart displays the frequency table by splitting the data values first into stems (classes) according to the front decimal places and leaves according to the last decimal place. (By counting the number of leaves instead of the recording their values, one has a frequency table.)
Distributions
Term (English)
Definition (English)
Binomial Probability Distribution
A discrete PDF that given a specific number n of trialsof a binomial experiment, graphs the probability that k number of successes of probability p will occur over that number of trials. (memoryless)
Normal Probability Distribution
A continuous PDF that has a bell shaped graph. Key: According to the CLT, the means of a large number of samples, each with sample size >30 will follow a normal PDF.
Poisson Probability Distribution
A discrete PDF that given a specific period of time, graphs the probability that k number of events occur over that period of time. (memoryless)
Тестирање на хипотези (со еден или два примероци)
Term (English)
Definition (English)
Alternative Hypothesis
Denoted by H1, represents the claim and is the opposite of the null hypothesis (so >, < or ≠) and is accepted if the null hypothesis H0 is rejected.
Central Limit Theorem CLT
A theorem that states as the number of samples (each with size n>30) gets larger, the sample means of these samples tend to follow a normal probability distribution (regardless of the distribution type of the samples).
Confidence Interval
The range of values between the tails. It depends on the significance level α. Its center is the sample mean, its total width is the margin of error.
Hypothesis
An assumption about a population parameter. With one sample or two dependent samples, it is an assumption about the population mean . With two independent samples, it is an assumption about the difference in the means of the two populations.
Hypothesis Testing
A systematic way of calculating parameters from a sample in order to determine (test) whether a hypothesis is true or not.
Confidence Level (1-α)
The probability that the confidence interval will include the true mean of the population. It equals the proportion that is not under the tail(s). A larger confidence level means the conclusion is more dependable.
Significance Level α
The proportion of the area under the tail(s),i.e. outside the confidence interval. Smaller values of α imply the conclusion is more dependable. Values are usually between 0.01≤α≤0.1.
Margin of Error
The width of a confidence interval. It depends on the, significance level, the standard error and the test used.
Null Hypothesis
Denoted by H0, this represents the status quo and involves stating the belief that the mean of the population is ≤, ≥ or =.To prove a claim, you want to reject H0.
p-value or probability value
The largest level of confidence (or equivalently, the smallest level of significance α) at which the null hypothesis H0 can be rejected. A.k.a. Observed Level of Significance.
Observed Level of Significance
see p-value
One-Tail Hypothesis Test
This test is used when the alternative hypothesis H1 is being stated using inequalities (< or >). With one tail, we use α when calculating the test statistic.
Pooled Estimate of the Standard Deviation
Given two small samples, a weighted average of the two sample standard deviations. Used to calculate (estimate) the standard error of the mean when the standard deviation is not known.
SampleStandard Deviation s
Used to calculate (estimate) the standard error of the mean when the standard deviation σ of the population is not known.
=STDEV.S
Sampling Distribution of the Sample Mean
The distribution of the sample means of a large number of samples. The CLT says that this distribution is the normal distribution.
Standard Deviation σ
The standard deviation of the population.
=STDEV.P
Standard Error of the Difference between Two Means
The error describes the variation in the difference between two sample means.
(standard deviations known)
(both samples are >30)
(all other sits.)
Standard Error of the Meanσμ =
Measures the standard deviation of the sample mean. Used to calculate ZTest or TTest. (standard deviation known)
(all other situations)
Standard Error of the Proportion σp
Definition: The standard deviation of the sample proportions.
Calculation:
TTest Statistic
Measures the minimum number of standard errors of the mean that the sample mean must be from the hypothetical mean before being allowed to reject the null hypothesis H0. Used when sample size <30 and depends on degrees of freedom and confidence level α. Use the Student's t-distribution table to calculate.
Test Statistic
A measure from a sample used to decide whether or not to reject the null hypothesis.
Two-Tail Hypothesis Test
This test is used whenever the alternative hypothesis H1 is expressed as ≠. With two tails, we use α/2 when calculating the test statistic.
Type I Error
Occurs when the null hypothesis is rejected when, in reality, it is true.
Type II Error
Occurs when the null hypothesis is not rejected when, in reality, it is false.
ZTest Statistic
Measures the minimum number of standard errors of the mean that the sample mean must be from the hypothetical mean before being allowed to reject the null hypothesis H0. Used when sample size is >30 and depends on confidence level α. Use the normal distribution table to calculate.
Регресија и корелација
Coefficient of Determination, r2
Term represents the percentage of the variation in y that is explained by the regression line.
Correlation Coefficient
Indicates the strength and direction of the linear relationship between the independent (x) and dependent (y) variables.
Least Squares Method
A mathematical procedure to identify the linear equation that best fits a set of ordered pairs by finding values for the slope a, and the y-intercept b. The goal of the least squares method is to minimize the total squared error between the values of y and y.
Simple Regression
A procedure that describes a straight line that best fits a series of ordered pairs (x,y).
Standard Error of the Estimate
Measures the amount of dispersion of the observed data around the regression line.
ANOVA
Term (English)
Definition (English)
Analysis of Variance (ANOVA)
A procedure to test the difference between more than two population means.
Completely Randomized One-Way ANOVA
An analysis of variance procedure that involves the independent random selection of observations for each level of one factor.
Factor
Describes the cause of the variation in the data for analysis of variance ANOVA.
Level ANOVA
The number of categories within the factor of interest in the analysis of variance procedure.
Mean Square Between (MSB)
A measure of variation between the sample means.
Mean Square Within (MSW)
A measure of the variation within each sample.
One-Way ANOVA
An analysis of variance procedure where only one factor is considered.
Randomized Block ANOVA
Analysis of variance procedure that controls for variations from other sources than the factors of interest.
Scheffe Test
This test is used to determine which of the sample means are different after rejecting the null hypothesis using ANOVA.
Sum of Squares Between (SSB)
The variation among the samples in ANOVA.
Sum of Squares Block (SSBL)
The variation among the blocks in ANOVA.
Sum of Squares Within (SSW)
The variation within the samples in ANOVA.
Total Sum of Squares
The total variation in ANOVA that is obtained by adding SSB to SSW.
STATISTICS GLOSSARY
Functions: The "outside"variable that depends on the "inside" independent variable(s).
Functions: The "inside" variable that is used to calculate the "outside" dependent variable .
Probability
Statistical Charts, Diagrams, Plots
Distributions
Тестирање на хипотези (со еден или два примероци)
=STDEV.S
=STDEV.P
(standard deviations known)
(both samples are >30)
(all other sits.)
(all other situations)
Calculation:
Регресија и корелација
ANOVA
Sources
English[1] Donenelly R. Statistics. 2nd Ed., The Complete Idiot's Guide, Alpha Publishing, 2007.
[2] http://mathcentral.uregina.ca/QQ/database/QQ.09.99/raeluck1.html (difference between bar charts and histograms)
[3] http://onlinestatbook.com/2/index.html
[1] Mode is included here although it is not a measure, but a name-description.